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Abstract Photon-limited imaging arises when the 
number of photons collected by a sensor array is small 
relative to the number of detector elements. Photon 
limitations are an important concern for many appli- 
cations such as spectral imaging, night vision, nuclear 
medicine, and astronomy. Typically a Poisson distri- 
bution is used to model these observations, and the 
inherent heteroscedasticity of the data combined with 
standard noise removal methods yields significant arti- 
facts. This paper introduces a novel denoising algorithm 
for photon-limited images which combines elements of 
dictionary learning and sparse patch-based representa- 
tions of images. The method employs both an adapta- 
tion of Principal Component Analysis (PCA) for Pois- 
son noise and recently developed sparsity-regularized 
convex optimization algorithms for photon-limited im- 
ages. A comprehensive empirical evaluation of the pro- 
posed method helps characterize the performance of 
this approach relative to other state-of-the-art denois- 
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ing methods. The results reveal that, despite its concep- 
tual simplicity, Poisson PCA-based denoising appears 
to be highly competitive in very low light regimes. 

Keywords Image denoising • PCA • Gradient 
methods • Newton's method • Signal representations 



1 Introduction, model, and notation 

In a broad range of imaging applications, observations 
correspond to counts of photons hitting a detector ar- 
ray, and these counts can be very small. For instance, in 
night vision, infrared, and certain astronomical imaging 
systems, there is a limited amount of available light. 
Photon limitations can even arise in well-lit environ- 
ments when using a spectral imager which character- 
izes the wavelength of each received photon. The spec- 
tral imager produces a three-dimensional data cube, 
where each voxel in this cube represents the light in- 
tensity at a corresponding spatial location and wave- 
length. As the spectral resolution of these systems 
increases, the number of available photons for each 
spectral band decreases. Photon-limited imaging algo- 
rithms are designed to estimate the underlying spa- 
tial or spatio-spectral intensity underlying the observed 
photon counts. 

There exists a rich literature on image estimation 
or denoising methods, and a wide variety of effective 
tools. The photon-limited image estimation problem is 
particularly challenging because the limited number of 
available photons introduces intensity-dependent Pois- 
son statistics which require specialized algorithms and 
analysis for optimal performance. Challenges associated 
with low photon count data are often circumvented 
in hardware by designing systems which aggregate 
photons into fixed bins across space and wavelength 
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{i.e., creating low-resolution cameras). If the bins are 
large enough, the resulting low spatial and spectral 
resolution cannot be overcome. High-resolution obser- 
vations, in contrast, exhibit significant non-Gaussian 
noise since each pixel is generally either one or zero 
(corresponding to whether or not a photon is counted 
by the detector), and conventional algorithms which ne- 
glect the effects of photon noise will fail. Simply trans- 
forming Poisson data to produce data with approximate 
Gaussian noise (via, for instance, the variance stabiliz- 
ing Anscombe transform plIST] or Fisz transform [T3 
[T8] ) can be effective when the number photon counts 
are uniformly high [5ll46]. However, when photon counts 
are very low these approaches may suffer, as shown later 
in this paper. 

This paper demonstrates how advances in low- 
dimensional modeling and sparse Poisson intensity re- 
construction algorithms can lead to significant gains in 
photon-limited (spectral) image accuracy at the reso- 
lution limit. The proposed method combines Poisson 
Principal Component Analysis (Poisson-PCA - a spe- 
cial case of the Exponential-PC A |lQ[ l4Qj) and sparse 
Poisson intensity estimation methods [20^ in a non-local 
estimation framework. We detail the targeted opti- 
mization problem which incorporates the heteroscedas- 
tic nature of the observations and present results im- 
proving upon state-of-the-art methods when the noise 
level is particularly high. We coin our method Pois- 
son Non-Local Principal Component Analysis (Poisson 
NLPCA). 

Since the introduction of non-local methods for im- 
age denoising [8], these methods have proved to out- 
perform previously considered approaches [T1[TTJ[30[|12] 
(extensive comparisons of recent denoising method can 
be found for Gaussian noise in [2T|[26]). Our work is in- 
spired by recent methods combining PGA with patch- 
based approaches [33,47|IT5] for the Additive White 
Gaussian Noise (AWGN) model, with natural exten- 
sions to spectral imaging [T^. A major difference be- 
tween these approaches and our method is that we di- 
rectly handle the Poisson structure of the noise, without 
any "Gaussianization" of the data. Since our method 
does not use a quadratic data fidelity term, the singu- 
lar value decomposition (SVD) cannot be used to solve 
the minimization. Our direct approach is particularly 
relevant when the image suffers from a high noise level 
{i.e., low photon emissions). 



1.1 Organization of the paper 

In Section |1.2[ we describe the mathematical frame- 
work. In Section [2j we recall relevant basic properties 



of the exponential family, and propose an optimiza- 
tion formulation for matrix factorization. Section |3] pro- 
vides an algorithm to iteratively compute the solution 
of our minimization problem. In Section [5j an impor- 
tant clustering step is introduced both to improve the 
performance and the computational complexity of our 
algorithm. Algorithmic details and experiments are re- 
ported in Section [6] and [7| and we conclude in Section 

m 

1.2 Problem formulation 

For an integer M > 0, the set {1,...,M} is denoted 
|1, M]. For i G |1, M], let yi be the observed pixel val- 
ues obtained through an image acquisition device. We 
consider each i/i to be an independent random Poisson 
variable whose mean /i > is the underlying intensity 
value to be estimated. Explicitly, the discrete Poisson 
probability of each yi is 

nVilfi) = ^e-f\ (1) 

where 0! is understood to be 1 and 0^ to be 1. 

A crucial property of natural images is their ability 
to be accurately represented using a concatenation of 
patches, each of which is a simple linear combination of 
a small number of representative atoms. One interpre- 
tation of this property is that the patch representation 
exploits self-similarity present in many images, as de- 
scribed in AWGN settings [TT[l3Ql[T2] . Let Y denote the 
M X N matrix of all the vectorized a/TV x overlap- 
ping patches (neglecting border issues) extracted from 
the noisy image, and let F be defined similarly for the 
true underlying intensity. Thus Yij is the jth pixel in 
the ith patch. 

Many methods have been proposed to represent the 
collection of patches in a low dimensional space in the 
same spirit as PGA. We use the framework considered 
in p^ [4Q]. that deals with data well- approximated by 
random variables drawn from exponential family distri- 
butions. In particular, we use Poisson-PCA, which we 
briefiy introduce here before giving more details in the 
next section. With Poisson-PCA, one aims to approxi- 
mate F by: 

Fij ^ exp{[UV]ij) V(i,j) G II, Ml X [l.TV], (2) 
where 

— U is the M X i matrix of coefficients; 

— y is the i X N matrix representing the dictionary 
components or axis. The rows of V represents the 
dictionary elements; and 
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— ex.p{UV) is the element-wise exponentiation of UV: 
exp{[UV]ij) := [exp(C/y)]^_^.. 

The approximation in ([2| is different than the approxi- 
mation model used in similar methods based on AWGN, 
where typically one assumes Fij ^ [UV]ij (that is, 
without exponentiation) . Our exponential model allows 
us to circumvent challenging issues related to the non- 
negativity of F and thus facilitates significantly faster 
algorithms. 

The goal is to compute an estimate of the form ([2| 
from the noisy patches Y. We assume that this approx- 
imation is accurate for i <C M, whereby restricting the 
rank £ acts to regularize the solution. In the following 
section we elaborate on this low-dimensional represen- 
tation. 

2 Exponential family and matrix factorization 

We present here the general case of matrix factorization 
for an exponential family, though in practice we only 
use this framework for the Poisson and Gaussian cases. 
We describe the idea for a general exponential family 
because our proposed method considers Poisson noise, 
but we also develop an analogous method (for compar- 
ison purposes) based on an Anscombe transform of the 
data and a Gaussian noise model. The solution we fo- 
cus on follows the one introduced by [10]. Some more 
specific details can be found in [4Ql l39] about matrix 
factorization for exponential families. 



The parameter ^ G 6^ is called the natural parameter 
of V and the set O is called the natural parameter 
space. The function <P is called the log partition func- 
tion. We denote by 'Eo[-] the expectation w.r.t. po: 

Ee[g{X)]= [ g{y){exp{{0\cj,{y)))-m)dp{y). 

JX 

Example 1 Assume the data are independent (not nec- 
essarily identically distributed) Gaussian random vari- 
ables with means fii and (known) variances a^. Then 
the parameters are: My G W^^(j){y) = ^{6) — 
YTi^i ^l/2a2 and V^(^) = (^i/a^, • • • , O^/a^) and v is 
the Lebesgue measure on (c/. ^] for more details on 
the Gaussian distribution, possibly with non-diagonal 
covariance matrix). 

Example 2 For Poisson distributed data (not necessar- 
ily identically distributed), the parameters are the fol- 
lowing: G W,^{y) = y, and ^{0) = (exp(^)|l,) = 
Yl7=i where exp is the component-wise exponential 
function: 

exp:(^i,--- ,^^)^(e^S--- ,e^-), (5) 

and In is the vector (!,••• ,1)^ G M^. Moreover 
V^{9) = exp(^) and u is the counting measure on N 
weighted by e/n\. 

Remark 1 The standard parametrization is usually dif- 
ferent for Poisson distributed data, and this family is 
often parametrized by the rate parameter / = exp(6>). 



2.1 Background on the exponential family 

We assume that the observation space y is equipped 
with a cr-algebra B and a dominating cr-finite measure 
V on (3^, B). Given a positive integer n, let (j): y 
be a measurable function, and let ^/e, k = 1,2, ••• ,n 
denote its components: (j){y) = (^i(^), • • • , 0n(^))- 

Let O be defined as the set of all G such that 
J-y exp((^ I (/)(?/) ))(ii^ < oo. We assume it is convex and 
open in this paper. We then have the following defini- 
tion: 

Definition 1 An exponential family with sufficient 
statistic (j) is the set 7^(^) of probability distributions 
w.r.t. the measure v on (3^, B) parametrized hy 6 e 
such that each probability density function po G 7^(0) 
can be expressed as 

pe{y)=exp{{0\cj>{y))-m}, (3) 
where 

^(^)=log / exp{(^|(/>(y))}d%). (4) 



2.2 Bregman divergence 

The general measure of proximity we use in our analysis 
relies on Bregman divergence [7 . For exponential fami- 
lies, the relative entropy (Kullback-Leibler divergence) 
between p^i^ and po^ in 7^(^), defined as 

D^{Pei\\pe2)= / Pe^^og{peJpe2)di^^ (6) 

can be simply written as a function of the natural pa- 
rameters: 

D^{peA\Po,) = ^{02) - ^{Oi) - (V^(^i)|^2 - ^i). 

From the last equation, we have that the mapping : 
O X O ^R, defined by D^{6i,62) = D^(jpe^\\pe.2), is a 
Bregman divergence. 

Example 3 For Gaussian distributed observations with 
unit variance and zero mean, the Bregman divergence 
can be written: 

I>G(^l,^2) = ||^l-^2||i. (7) 



4 



Joseph Salmon et al. 



Example 4 For Poisson distributed observations, the 
Bregman divergence can be written: 

Dp{ei,e2) = {exp(02)-exp(^i)|l„)-(exp(^i)|^2-^i). 

(8) 

We define the matrix Bregman divergence as 

Df{X\\Y)='P{Y)-'P{X) 

-Tr((V#(X))^(X-y)), (9) 

for any (non necessarily square) matrices X and Y of 
size M X N. 



2.3 Matrix factorization and dictionary learning 

Suppose that one observes Y G R^^^, and let Yi^. 
denote the ith patch in row-vector form. We would 
like to approximate the underlying intensity F by a 
combination of some vectors, atoms, or dictionary el- 
ements V = [^1 , • • • , v^] , where each patch uses differ- 
ent weights on the dictionary elements. In other words, 
the ith patch of the true intensity, denoted ., is ap- 
proximated as exp(ii^y), where Ui is the ith row of U 
and contains the dictionary weights for the ith patch. 
Note that we perform this factorization in the natural 
parameter space, which is why we use the exponential 
function in the formulation given in Eq. (§. 

Using the divergence defined in ^ our objective is 
to find U and V minimizing the following criterion: 

M 

D^{Y\\UV) =Y,^{u,V) - Y,, - (F,, - y,, :) . 

In the Poisson case, the framework introduced in 
p!Ql[4Q] uses the Bregman divergence in Example [4] and 
amounts to minimizing the following loss function 

M N 

mv) = Y,T.^Muv),^j -y^Auv)^,J (10) 

i=l j=l 

with respect to the matrices U and V. Defining the 
corresponding minimizers of the biconvex problem 

(/7*,V*)G argmin L{U,V) , (11) 



Remark 2 The classical PCA (of order £) is obtained 
using the Gaussian distribution, which leads to solving 



our image intensity estimate is 
F = exp(/7*V*). 



(12) 



the same minimization as in Eq. (11), except that L is 
replaced by 



M N 

Z(/7,F) = ^^(([/n,.->^M)'. 

Remark 3 The problem as stated is non-identifiable, as 
scaling the dictionary elements and applying an inverse 
scaling to the coefficients would result in an equiva- 
lent intensity estimate. Thus, one should normalize the 
dictionary elements so that the coefficients cannot be 
too large and create numerical instabilities. The easiest 
solution is to impose that the atoms Vi are normal- 
ized w.r.t. the standard Euclidean norm, i.e., for all 
i G {1, • • • one ensures that the constraint \\vi\\2 = 
Xlj=i ^ J- = 1 is satisfied. In practice though, relaxing 
this constraint modifies the final output in a negligible 
way while helping to keep the computational complex- 
ity low. 



3 Newton's method for minimizing L 

Here we follow the approach proposed by p^[35| that 
consists in using Newton steps to minimize the func- 
tion L. Though L is not jointly convex in U and V ^ 
when fixing one variable and keeping the other fixed 
the partial optimization problem is convex {i.e., the 
problem is biconvex). Therefore we consider Newton 
updates on the partial problems. To apply Newton's 
method, one needs to invert the Hessian matrices with 
respect to both U and V, defined by Hu = V}jL{U, V) 
and Hy = VyL{U^ V). Simple algebra leads to the fol- 
lowing closed form expressions for the components of 
these matrices (for notational simplicity we use pixel 
coordinates to index the entries of the Hessian): 



d^L{U,V) 
dUa,bdUc4 



N 



^exp(/7F),,,F,^^.,if (a,6) = (c,d), 
otherwise. 



and 



M 



d^L{U,V) 



^/7^,exp(/7n,6,if (a,6) = (c,d), 
otherwise, 







This is what we call Poisson-PCA (of order i) in the 
remainder of the paper. 



where both partial Hessians can be represented as di- 
agonal matrices {of. Appendix [C| for more details). 

We propose to update the rows of U and columns of 
V as proposed in ^5] . We introduce the function Vectc 
that transforms a matrix into one single column (con- 
catenates the columns), and the function Vect^^ that 
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transforms a matrix into a single row (concatenates the 
rows). Precise definitions are given in Appendix [d] The 
updating step for U and V are then given respectively 

by 

VectR{Ut+i) = VectRiUt) - Vect^ {VuL{Ut,Vt))H[^^ 
and 

Vectc(T4+i) = Vectc(T4) - Hy^^ Vectc {VvL{Ut, Vt)) . 

Simple algebra (c/. Appendix|D]or fi9\ for more details) 
leads to the following updating rules for the ith row of 
Ut+i (denoted 

^t+i,z,: = Ut,^, - {eMUtVth, - r,,:)F7(FtA^7)-' , 

(13) 

where A = diag ( exp([/^l/^)i,i, . . . , exp(/7^1/t)i,Ar) is a 
diagonal matrix of size N x N. The updating rule for 
Vt^:j^ the jth column of Vt^ is computed in a similar 
way, leading to 

{Ul,,E,Ut^,)-'Ul,,{eMUt^iVt):,j - K,,), (14) 

where Ej = diag ( exp(A+i^t)i,j, • • . ,ex.p{Ut^iVt)Mj) 
is a diagonal matrix of size M x M. More details about 
the implementation are given in Algorithm [l] 

Algorithm 1 Poisson NLPCA/ NLSPCA 

Inputs: Noisy pixels yi for i = 1, . . . , M 

Parameters: Patch size \/N x VN, number of clusters K, 

number of components i, maximal number of iterations 

A^iter 

Output: estimated image / 
Method: 

Patchization: create the collection of patches for the noisy 
image Y 

Clustering: create K clusters of patches using K-Means 
The kth cluster (represented by a matrix Y^) has Mj^ ele- 
ments 

for all cluster k do 

Initialize Uq = randn(Mfc,-^) and Vq = randn(-^, A^) 
while t < Niter and test > Sstop do 
for all z < Mfe do 

Update the ith row of U using {13} or ([17|)-([19} 
end for 
for all j < i do 

Update the jth column of V using ( |14| ) 
end for 
t\=t-\-l 
end while 

= exp(UtVt) 
end for 

Concatenation: fuse the collection of denoised patches F 
Reprojection: average the various pixel estimates due to 
overlaps to get an image estimate: / 



4 Improvements through penahzation 



A possible alternative to minimizing Eq. (10), consists 



of minimizing a penalized version of this loss, whereby 
a sparsity constraint is imposed on the elements of U 
(the dictionary coefficients). Related ideas have been 
proposed in the context of sparse PCA [48], dictionary 
learning [27 , and matrix factorization [301129] in the 
Gaussian case. Specifically, we minimize 



L^^"(/7, V) = L(U, F) + A Pen(/7), 



(15) 



where Pen(t/) is a penalty term that ensures we use 
only a few dictionary elements to represent each patch. 
The parameter A controls the trade-off between data 
fitting and sparsity. We focus on the following penalty 
function: 



Pen(C/) = \Uij 



(16) 



We refer to the method as the Poisson Non-Local Sparse 
PCA (NLSPCA). 

The algorithm proposed in [29] can be adapted with 
the SpaRSA step provided in [44], or in our setting by 
using its adaptation to the Poisson case - SPIRAL [20] . 
First one should note that the updating rule for the dic- 



tionary element, i.e., Equation (14), is not modified. 



Only the coefficient update, i.e., Equation (13) is mod- 
ified as follows: 

Ut^i. = argmin(exp(^iF,)|l) - + MMi- 

(17) 

For this step, we use the SPIRAL approach. This leads 
to the following updating rule for the coefficients: 



: arg mm 



^1 

at 



1 



(18) 



subject to 7t = Ut^: - ^VufiUt^-). 

at 

where > and the function / is defined by 
f{u) = {eMuVtm - {uVt\Yt+i,). 
The gradient can thus be expressed as 
Vf{u) = (exp(wT4+i) - 



Then the solution of the problem (18), is simply 
A 



at 



(19) 



where t^st is the soft-thresholding function r]sT{x^r) 
sign(x) • {\x\ -r)+. 
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Other methods than SPIRAL for solving the Pois- 
son £i-constrained problem could be investigated, 
e.g., Alternating Direction Method of Multipliers 
(ADMM) algorithms for ^i-minimization {cf. [45,6], 
or one specifically adapted to Poisson noise [16 ), 
though choosing the augmented Lagrangian parameter 
for these methods can be challenging in practice. 

5 Clustering step 

Most strategies apply matrix factorization on patches 
extracted from the entire image. A finer strategy con- 
sists in first performing a clustering step, and then ap- 
plying matrix factorization on each cluster. Indeed, this 
avoids grouping dissimilar patches of the image, and al- 
lows us to represent the data within each cluster with a 
lower dimensional dictionary. This may also improve on 
the computation time of the dictionary. In [lTl[3Q] , the 
clustering is based on a geometric partitioning of the 
image. This improves on the global approach but may 
results in poor estimation where the partition is too 
small. Moreover, this approach remains local and can- 
not exploit the redundancy inside similar disconnected 
regions. We suggest here using a non-local approach 
where the clustering is directly performed in the patch 
domain similarly to [9 . Enforcing similarity inside non- 
local groups of patches results in a more robust low 
rank representation of the data, decreasing the size of 
the matrices to be factorized, and leading to efficient 
algorithms. Note that in [15], the authors studied an 
hybrid approach where the clustering is driven in a hi- 
erarchical image domain as well as in the patch domain 
to provide both robustness and spatial adaptivity. We 
have not considered this approach since, while increas- 
ing the computation load, it yields to significant im- 
provements particularly at low noise levels, which are 
not the main focus of this paper. 

For clustering we have compared two solutions: one 
using only a simple K-means on the original data, and 
one performing a Poisson K-means. In similar fash- 
ion for adapting PGA for exponential families, the K- 
means clustering algorithm can also be generalized us- 
ing Bregman divergences; this is called Bregman clus- 
tering [2. This approach, detailed in Algorithm [2j 
has an EM (Expectation-Maximization) flavor and is 
proved to converge in a finite number of steps. 

The two variants we consider differ only in the 
choice of the divergence d used to compare elements 
X with respect to the centers of the clusters xc- 

— Gaussian: Uses the divergence defined in ([7|): 
dif,fc) = DGif,fc) = \\f-fc\\l 



— Poisson: Uses the divergence defined in ([8|: 
dif, fc) = Z)p(log(/), log(/c)) = J2fc- f 

j 

where the log is understood element- wise (note that 
the difference with (|8| is only due to a different 
parametrization here). 

In our experiments, we have used a small number (for 
instance K = 14) of clusters fixed in advance. 



Algorithm 2 Bregman hard clustering 

Inputs: Data points: (fi)f^i G M'^, number of clusters: K, 

Bregman divergence: ci : x ^ M+ 

Output: Clusters centers: (l^k)k'=i^ partition associated : 

Method: 

Initialize (Mfc)fcLi by randomly selecting K elements among 
repeat 

(The Assignment step: Cluster updates) 
Set Cfc := 0, 1 < A; < K 
for i= I, - - ,M do 

Ck* —Ck* U{/,} 

where k* = argmin d{fi^iiy) 
fc' = i, -- ,K 

end for 

{The Estimation step: Center updates) 
for k = 1, • • • , X do 

end for 
until convergence 



In the low-intensity setting we are targeting, clus- 
tering on the raw data may yield poor results. A pre- 
liminary image estimate might be used for performing 
the clustering, especially if one has a fast method giv- 
ing a satisfying denoised image. For instance, one can 
apply the Bregman hard clustering on the denoised im- 
ages obtained after having performed the full Poisson 
NLPCA on the noisy data. This approach was the one 
considered in the short version of this paper |36] , where 
we were using only the classical i^-means. However, we 
have noticed that using the Poisson i^-means instead 
leads to a significant improvement. Thus, the benefit of 
iterating the clustering is lowered. In this version, we do 
not consider such iterative refinement of the clustering. 
The entire algorithm is summarized in Fig. [l] 

6 Algorithmic details 

We now present the practical implementation of our 
method, for the two variants that are the Poisson 
NLPCA and the Poisson NLSPCA. 
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Collections of denoised 
patches 



FUSION 




PATCHIZATION 



REPROJECTION 



y = Poisson(/) 



/ 



Fig. 1 Visual summary of our denoising method. In this work we mainly focus on the two highlighted points of the figure: 
clustering in the context of very photon-limited data, and specific denoising method for each cluster. 



6.1 Initialization 

We initialize the dictionary at random, drawing the en- 
tries from a standard normal distribution, that we then 
normalize to have a unit Euclidean norm. This is equiv- 
alent to generating the atoms uniformly at random from 
the Euclidean unit sphere. As a rule of thumb, we also 
constrain the first atom (or axis) to be initialized as 
a constant vector. However, this constraint is not en- 
forced during the iterations, so this property can be 
lost after few steps. 



6.3 Reprojections 

Once the whole collection of patches is denoised, it 
remains to reproject the information onto the pixels. 
Among various solutions proposed in the literature (see 
for instance ^T] and ^j) the most popular, the one we 
use in our experiments, is to uniformly average all the 
estimates provided by the patches containing the given 
pixel. 

6.4 Binning-interpolating 



6.2 Stopping criterion and conditioning number 

Many methods are proposed in [44] for the stopping 
criterion. Here we have used a criterion based on the 
relative change in the objective function L^®"([/, F) 
defined in Eq. (15). This means that we iterate 
the alternating updates in the algorithm as long 
||exp(C/tFt) - exp([/t+iT4+i)||V||exp([/tT4)||2 < e^iop 
for some (small) real number Sgtop- 

For numerical stability we have added a Tikhonov 
(or ridge) regularization term. Thus, we have substi- 



tuted VtDiV^ in Eq. (ITs]) with {VtDiV^ ^e^ondh) and 



{UjEjUt) in Eq. ^ with {Uj E^Ut) ^ Scondh)- For 
the NLSPCA version the Scond parameter is only used 



to update the dictionary in Eq. (14), since the regular- 



ization on the coefficients is provided by Eq. (17). 



Following a suggestion of an anonymous reviewer, we 
have also investigated the following "binned" variant of 
our method: 

1. aggregate the noisy Poisson pixels into small (for 
instance 3x3) bins, resulting in a smaller Poisson 
image with lower resolution but higher counts per 
pixel; 

2. denoise this binned image using our proposed 
method; 

3. enlarge the denoised image to the original size using 
(for instance bilinear) interpolation. 

Indeed, in the extreme noise level case we have consid- 
ered, this approach significantly reduces computation 
time, and for some images it yields a significant perfor- 
mance increase. The binning process allows us to im- 
plicitly use larger patches, without facing challenging 



8 



Joseph Salmon et al. 



memory and computation time issues. Of course, such 
a scheme could be applied to any method dealing with 
low photon counts, and we provide a comparison with 
the BM3D method (the best overall competing method) 
in the experiments section. 



7 Experiments 

We have conducted experiments both on simulated and 
on real data, on grayscale images (2D) and on spectral 
images (3D). We summarize our results in the following, 
both with visual results and performance metrics. 



7.1 Simulated 2D data 



Parameter Definition 



Value 




Cameraman Pepper Bridge Ridges 

Fig. 2 Original images used for our simulations. 

We have first conducted comparisons of our method 
and several competing algorithms on simulated data. 
The images we have used in the simulations are pre- 
sented in Fig. |2] We have considered the same noise 
level for the Saturn image (c/. Fig. [8| as in [4T1, where 
one can find extensive comparisons with a variety of 
multiscale methods [23,42 ,24 . 

In terms of PSNR, defined in the classical way (for 
8-bit images) 



PSNR(/,/) = 101ogio 



2552 



fi) 



2' 



(20) 



our method globally improves upon other state-of-the- 
art methods such as Poisson-NLM [14 , SAFIR [5 , and 
Poisson Multiscale Partitioning (PMP) [42] for the very 
low light levels of interest. Moreover, visual artifacts 
tend to be reduced by our Poisson NLPCA and NL- 
SPCA, with respect to the version using an Anscombe 
transform and classical PC A (c/. AnscombeNLPCA in 
Figs, [s] and [g] for instance). See Section 7.4 for more 
details on the methods used for comparison. 



N 


patch size 


20 X 20 


i 


approximation rank 


4 


K 


clusters 


14 


Niter 


iteration limit 


20 


^stop 


stopping tolerance 


10-1 


^cond 


conditioning parameter 


10-3 


A 


ii regularization 


/log(Mfc) 
V n 


(NL-SPCA only) 



Table 1 Parameter settings used in the proposed method. 
Note: Mfc is the number of patches in the kth cluster as de- 
termined by the Bregman hard clustering step. 



All our results for 2D and 3D images are provided 
for both the NLPCA and NLSPCA using (except oth- 
erwise stated) the parameter values summarized in Ta- 
ble [1] The step-size parameter at for the NL-SPCA 
method is chosen via a selection rule initialized with 
the Barzilai-Borwein choice, as described in |2Q|. 



7.2 Simulated 3D data 

In this section we have tested a generalization of our 
algorithm for spectral images. We have thus considered 
the NASA AVIRIS (Airborne Visible/Infrared Imaging 
Spectrometer) MoflFett Field reflectance data set, and 
we have kept a 256 x 256 x 128 sized portion of the to- 
tal data cube. For the simulation we have used the same 
noise level as in [25] (the number of photons per voxel 
is 0.0387), so that comparison could be done with the 
results presented in this paper. Moreover to ease com- 
parison with earlier work, the performance has been 
measured in terms of mean absolute error (MAE), de- 
fined by 



MAE(/,/) = 



11/ -/111 



(21) 



We have performed the clustering on the 2D image 
obtained by summing the photons on the third (spec- 
tral) dimension, and using this clustering for each 3D 
patch. This approach is particularly well suited for low 
photons counts since with other approaches the cluster- 
ing step can be of poor quality. Our approach provides 
an illustration of the importance of taking into account 
the correlations across the channels. We have used non- 
square patches since the spectral image intensity has 
different levels of homogeneity across the spectral and 
spatial dimensions. We thus have considered elongated 
patches with respect to the third dimension. In prac- 
tice, the patch size used for the results presented is 
5 X 5 X 23, the number of clusters is = 30, and the 
order of approximation is ^ = 2. 
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For the noise level considered, our proposed al- 
gorithm outperforms the other methods, BM4D ^28] 
and PMP both visually and in term of MAE 

(c/. Fig. [9|. Again, these competing methods are de- 
scribed in Section [73J 



Standard deviation of Anscombe transformed Poisson data (10^ samples) 



7.3 Real 3D data 

We have also used our method to denoise some real 
noisy astronomical data. The last image we have con- 
sidered is based on thermal X-ray emissions of the 
youngest supernova explosion ever observed. It is the 
supernova remnant Gl.9+0.3 (@ NASA/CXC/SAO) in 
the Milky Way. The study of such spectral images can 
provide important information about the nature of ele- 
ments present in the early stages of supernova. We refer 
to [4 for deeper insights on the implications for astro- 
nomical science. This dataset has an average of 0.0137 
photons per voxel. 

For this image we have also used the 128 first 
spectral channels, so the data cube is also of size 
256 X 256 X 128. Our method removes some of the spuri- 
ous artifacts generated by the method proposed in [25j 
and the blurry artifacts in BM4D [28] . 



7.4 Comparison with other methods 

7.4-1 Classical PCA with Anscombe transform 

The approximation of the variance provided by the 
Anscombe transform is reasonably accurate for intensi- 
ties of three or more (c/. Fig.jsjand also [32] Fig. 1-b). In 
practice this is also the regime where a well-optimized 
method for Gaussian noise might be applied success- 
fully using this transform and the inverse provided in 

To compare the importance of fully taking advan- 
tage of the Poisson model and not using the Anscombe 
transform, we have derived another algorithm, analo- 
gous to our Poisson NLPCA method but using Bregman 
divergences associated with the natural parameter of a 
Gaussian random variable instead of Poisson. It cor- 
responds to an implementation similar to the classical 
power method for computing PCA [10 . The function L 



to be optimized in (10) is simply replaced by the square 
loss L, 



M N 




Fig. 3 Standard deviation approximation of some simu- 
lated Poisson data, after performing the Anscombe transform 
(Ansc). For each true parameter /, 10^ Poisson realizations 
where drawn and the corresponding standard deviation is re- 
ported. 



For the Gaussian case, the following update equations 



are substituted for (13) and (14) 



and 



mVth,-Y,,)V,~'{VtV,~')-' , (23) 



{Ul,,Ut^r)-'ul,, m^iVt):,j - Yj) . (24) 



Z([/,F) = ^^((/7n,.->^.,.f 



(22) 



=1 j=i 



An illustration of the improvement due to our di- 
rect modeling of Poisson noise instead of a simpler 
Anscombe (Gaussian) NLPCA approach is shown in 
our previous work [36 and the below simulation re- 
sults. The gap is most noticeable at low signal-to- noise 
ratios, and high-frequency artifacts are more likely to 
appear when using the Anscombe transform. To invert 
the Anscombe transform we have considered the func- 
tion provided by [3T], and available at http://www. 
cs.tut.fi/~foi/invansc/, This slightly improves the 
usual (closed form) inverse transformation, and in our 
work it is used for all the methods using the Anscombe 
transform (referred to as Anscombe-NLPCA in our ex- 
periments). 

7.4-2 Other methods 

We compare our method with other recent algorithms 
designed for retrieval of Poisson corrupted images. In 
the case of 2D images we have compared with: 

— NLBayes [26] using Anscombe transform and the 
refined inverse transform proposed in [31) . 

— SAFIR [22| [5]. using Anscombe transform and the 
refined inverse transform proposed in [3T] . 
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Poisson multiscale partitioning (PMP), introduced 
by Willett and Nowak [42 ','43! using full cycle spin- 
ning. We use the haarTIApprox function as avail- 
able at |http : //people . ee . duke . edu/^willett 



BM3D [3T1 using Anscombe transform with a re- 
fined inverse transform. The online code is available 
at http: //www. cs .tut .f i/~f oi/invansc/ and we 



used the default parameters provided by the au- 
thors. The version with binning and interpolation 
relies on 3 x 3 bins and bilinear interpolation. 

In the case of spectral images we have compared our 
proposed method with 

— BM4D [28] using the inverse Anscombe [31] already 
mentioned. We set the patch size to 4 x 4 x 16, since 
the patch length has to be dyadic for this algorithm. 

— Poisson multiscale partition (PMP for 3D images) 
[25] . adapting the haarTIApprox algorithm to the 
case of spectral images. As in the reference men- 
tioned, we have considered cycle spinning with 2000 
shifts. 

For visual inspection of the qualitative performance 
of each approach, the results are displayed on Fig. IpQ 



Quantitative performance in terms of PSNR are given 
in Tab. [2 



8 Conclusion and future work 

Inspired by the methodology of [15] we have adapted a 
generalization of the PC A [Tm i3? for denoising images 
damaged by Poisson noise. In general, our method finds 
a good rank-£ approximation to each cluster of patches. 
While this can be done either in the original pixel 
space or in a logarithmic "natural parameter" space, 
we choose the logarithmic scale to avoid issues with 
nonnegativity, facilitating fast algorithms. One might 
ask whether working on a logarithmic scale impacts 
the accuracy of this rank-^ approximation. Comparing 
against several state-of-the-art approaches, we see that 
because our approach often works as well or better than 
these alternatives, the exponential formulation of PCA 
does not lose significant approximation power or else it 
would manifest itself in these results. 

Possible improvements include adapting the number 
of dictionary elements used with respect to the noise 
level, and proving a theoretical convergence guarantees 
for the algorithm. The nonconvexity of the objective 
may only allow convergence to local minima. An open 
question is whether these local minima have interesting 
properties. Reducing the computational complexity of 
NLPCA is a final remaining challenge. 
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Appendix 

A Biconvexity of loss function 

Lemma 1 The function L is biconvex with respect to (U, V) but 
not jointly convex. 

Proof The biconvexity argument is straightforward; the par- 
tial functions U ^ L{U, V) with a fixed V and V ^ L{U, V) 
with a fixed U are both convex. The fact that the problem 
is non-joint ly convex can be seen when U and V are in M 
{i.e., i = m = n = 1), since the Hessian in this case is 



Hl{U,V) : 



V^e^ 



UVe^ 



Y 



UVe'- 



Y 



1 



Thus at the origin one has if (0,0) = (^-j^ gj? which has a 
negative eigenvalue, —1. 



B Gradient calculations 

We provi de b elow the gradient computation used in Eq. ( |13| ) 
and Eq. ([l4|: 

VuL{U, V) = (exp(C/y) - Y)V^ , 
VvL{U, V) = [/^(exp([/y) - Y) . 

Using the component-wise representation this is equiva- 
lent to 



dL{U, V) 
dL{U, V) 



= ^exp(t/y)«,,y,,, 



M 



dVr. 



a,b 



C Hessian calculations 

The approach proposed by p!9l[35] consists in using an itera- 
tive algorithm which sequentially updates the jth column of 
V and the ith row of U. The only problem with this method 
is numerical: one needs to invert possibly ill conditioned ma- 
trices at each step of the loop. 



Poisson noise reduction with non-local PCA 



11 




(a) Original 



(b) Noisy, PSNR=0.31 (c) haarTIApprox, (d) SAFIR, (e) BM3D, 

PSNR=18.69 PSNR=17.87 PSNR=19.30 




(f) AnscombePCA, (g) 



NLPCA, (h) 



NLPCAS, (i) 




BMSDbin, (j) 



PSNR=18.08 



PSNR=19.26 



PSNR=18.91 



PSNR=18.99 



NLPCASbin, 
PSNR=23.27 



Fig. 4 Toy cartoon image (Ridges) corrupted with Poisson noise with Peak = 0.1. 





A 




y 




(a) Original (b) Noisy, (c) haarTIApprox, (d) SAFIR, (e) BM3D, 

PSNR=10.00 PSNR=24.44 PSNR=24.92 PSNR=26.30 




(f) AnscombePCA, (g) NLPCA, (h) NLPCAS, 

PSNR=28.29 PSNR=30.75 PSNR=30.10 



(i) BM3Dbin, (j) NLPCASbin, 

PSNR=30.45 PSNR=28.32 



Fig. 5 Toy cartoon image (Ridges) corrupted with Poisson noise with Peak = 1. 



The Hessian matrices of our problems, with respect to U 
and V respectively are given by 



d'^L{U,V) 
dUa,bdUc,d 

and 

d'^L(U,V) 
dVa.bdV^.d ' 



J2^MUV)ajV,^j, if (a,6) = (c,d), 
j = i 

otherwise, 



^/72^exp(/7y),,b, if (a, 6) = (c,d), 
otherwise. 



i=l 





Notice that both Hessian matrices are diagonal. So applying 
the inverse of the Hessian simply consists in inverting the 
diagonal coefficients. 



D The Newton step 

In the following we need to introduce the function Vect^ that 
transforms a matrix into one single column (concatenates the 
columns), and the function Vectj^ that transforms a matrix 
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(a) Original (b) Noisy, PSNR=- (c) haarTIApprox, (d) SAFIR, (e) BM3D, 

7.11 PSNR=10.97 PSNR=12.04 PSNR=12.92 




(f) AnscombePCA, (g) NLPCA, (h) NLPCAS, (i) BMSDbin, (j) NLPCASbin, 

PSNR=13.18 PSNR-14.35 PSNR=14.40 PSNR=13.91 PSNR=15.99 



Fig. 6 Toy cartoon image (Flag) corrupted with Poisson noise with Peak = 0.1. 



(a) Original (b) Noisy, PSNR=2.91 (c) haarTIApprox, (d) SAFIR, (e) BM3D, 

PSNR=17.82 PSNR=17.91 PSNR=18.54 







1 1 




















1 




(f) AnscombePCA, (g) NLPCA, (h) NLPCAS, 


(i) BM3Dbin, 


(j) NLPCASbin, 



PSNR=19.94 PSNR=20.26 PSNR=20.37 PSNR=19.45 PSNR=17.12 

Fig. 7 Toy cartoon image (Flag) corrupted with Poisson noise with Peak = 1. 



Now using the previously introduced notations, the up- 
dating steps for U and V can be written 

Vectc(f/t+i) = Vectc(Ut) - H'^ Vectc {VuL{Ut,Vt)) , (25) 
VectH(Vt+i) = YectR{Vt) - YectR {VvL{Ut,Vt))H-^ . (26) 

We give the order used to concatenate the coefficients 
for the Hessian matrix with respect to t/, Hu'- (ct^b) = 
(1,1),--- ,(M, 1),(1,2),---(M,2),---(1,^),--- ,(M,^). 
We concatenate the column of U in this order. 

It is easy to give the updating rules for the kth column of 
one only needs to multiply the first Equation of (|25|) from 




into a single row (concatenates the rows). This means that 



Vectc 



and 
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(a) Original, channel 68 (b) Noisy data (c) BM4D, 4 x 4 x 16 (d) Multiscale parti- (e) NLSPCA, 5 x 5 x 23, 

MAE=0.2426 tion, MAE=0.1937 MAE=0.1893 




(f) Original, channel 68 (g) Noisy data (h) BM4D, 4 x 4 x 16, (i) Multiscale partition, (j) NLSPCA, 5 x 5 x 23, 

MAE=0.2426 MAE=0.1937, MAE=0.1893 

Fig. 9 Original and close-up of the red square from spectral band 68 of the Moffett Field. The same methods are considered, 
and are displayed in the same order: original, noisy (with 0.0387 photons per voxels), BM4D ^28^ (with inverse Anscombe as 
in [31]), multiscale partitioning method 25 , and our proposed method with patches of size 5 x 5 x 23. 

the left by the M x Mi matrix where is a diagonal matrix of size M x M: 



Fk,M,^, = {^M,M^ • • • ' ^M,M, • • • , Om,m) (27) Dk = diag ^ exp(UtVt)i 



where the identity block matrix is in the kth position. This ^ 
leads to the following updating rule exp(f^*^t)M,j 

Ut^i,-,k = Ut,:,k - D-\exp{UtVt) - y)V,yfe,: , (28) This leads easily to ([l3}. 
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Method 


Swoosh 


Saturn 


Flag 


House 


Cam 


Pepper 


Bridge 


Ridges 










Peak : 


= 0.1 








NLBayes 


11.08 


12.65 


7.14 


10.94 


10.54 


11.52 


10.58 


15.97 


haarTIApprox 


19.84 


19. S6 


12.72 


18.15 


17.18 


19.10 


16.64 


18.68 


SAFIR 


18.88 


20. S9 


12.24 


17.45 


16.22 


18. 5S 


16.55 


17.97 


BM3D 


17.21 


19. IS 


IS. 12 


16. 6S 


15.75 


17.24 


15.72 


19.47 


BMSDbin 


21.91 


20.82 


14. S6 


18.S9 


17.11 


18.84 


16.94 


20. SS 


NLPCA 


19.12 


20.40 


14.45 


18.06 


16.58 


18.48 


16.48 


21.25 


NLSPCA 


19.18 


20.45 


14.50 


18.08 


16.64 


18.49 


16.52 


20.56 


NLSPCAbin 


21.56 


19.47 


15.57 


18.68 


17.29 


18. 7S 


16.90 


23.52 










Peak : 


= 0.2 








NLBayes 


14.18 


14.75 


8.20 


IS. 54 


12.71 


IS. 89 


12.59 


16.19 


haarTIApprox 


21.55 


20.91 


IS. 97 


19.25 


18.37 


20. IS 


17.46 


20.46 


SAFIR 


20.86 


21.71 


IS. 65 


18.8S 


17.S8 


19.88 


17.41 


18.58 


BM3D 


20.27 


21.20 


14.25 


18.67 


17.44 


19. SI 


17.14 


21.10 


BMSDbin 


24.14 


22.59 


16.04 


19.93 


18.24 


20.22 


17.66 


2S.92 


NLPCA 


21.20 


22.29 


16. 5S 


19.08 


17.80 


19.69 


17.49 


24.10 


NLSPCA 


21.27 


22. S4 


16.47 


19.11 


17.77 


19.70 


17.51 


24.41 


NLSPCAbin 


24.04 


20.56 


16.65 


19.87 


17.90 


19.61 


17. 4S 


25.43 










Peak = 


= 0.5 








NLBayes 


19.60 


18.28 


10.19 


17.01 


15.68 


16.90 


15.11 


16.77 


haarTIApprox 


2S.59 


2S.27 


16.25 


20.65 


19.59 


21. SO 


18. S2 


2S.07 


SAFIR 


22.70 


24. 2S 


16.20 


20.S7 


18.84 


21.25 


18.42 


20.90 


BM3D 


2S.5S 


24.09 


15.94 


20.50 


18.86 


21. OS 


18. S7 


2S.SS 


BMSDbin 


26.20 


25.64 


18. 5S 


21.70 


19.58 


21.60 


18.75 


27.99 


NLPCA 


24.50 


25. S8 


18.93 


20.78 


19. S6 


21. IS 


18.47 


28.06 


NLSPCA 


24.44 


25.06 


18.92 


20.76 


19. 2S 


21.12 


18.46 


28. OS 


NLSPCAbin 


26.36 


20.67 


17.09 


20.97 


18. S9 


20.28 


18.16 


26.81 










Peak 










NLBayes 


2S.58 


21.66 


14.00 


19.27 


17.99 


19.48 


16.85 


18. S5 


haarTIApprox 


25.12 


25.06 


17.79 


21.97 


20.64 


22.25 


19.08 


24.52 


SAFIR 


2S.S7 


25.14 


17.91 


21.46 


20.01 


22.08 


19.12 


24.67 


BM3D 


26.21 


25.88 


18.45 


22.26 


20.45 


22.27 


19. S9 


25.76 


BMSDbin 


27.95 


27.24 


19.49 


23.26 


20.61 


22.53 


19.47 


29.91 


NLPCA 


26.99 


27.08 


20. 2S 


22.07 


20. SI 


21.96 


19.01 


30.17 


NLSPCA 


27.02 


27.04 


20.37 


22.10 


20.28 


21.88 


19.00 


SO. 04 


NLSPCAbin 


27.21 


21.10 


17. OS 


21.21 


18.45 


20. S7 


18. S6 


26.96 










Peak 


= 2 








NLBayes 


27.50 


24.66 


17.1S 


21.10 


19.67 


21. S4 


18.22 


21.04 


haarTIApprox 


27.01 


26.4S 


19. SS 


2S.S7 


21.72 


2S.18 


19.90 


26. 5S 


SAFIR 


2S.78 


26.02 


19.25 


22. SS 


21. SO 


22.74 


19.99 


28.29 


BMSD 


28. 6S 


27.70 


20.66 


24.25 




23.54 


20.44 


29.75 


BMSDbin 


29.70 


28.68 


20.01 


24.52 


21.42 


2S.4S 


20.17 


S2.24 


NLPCA 


29.41 


28.02 


20.64 


2S.44 


20.75 


22.78 


19. S7 


S2.25 


NLSPCA 


29. 5S 


28.11 


20.75 


2S.75 


20.76 


22.86 


19.45 


32.35 


1\TT Q TD A 


27.62 


21. IS 


17.02 


21.42 


18. SS 


20. S4 


18. S4 


29. SI 










Peak = 


= 0.14 








NLBayes 


SI. 17 


26. 7S 


22.64 


2S.61 


22. S2 


2S.02 


19.60 


24.04 


haarTIApprox 


28.55 


28. IS 


21.16 


24.88 


22. 9S 


24. 2S 


20. 8S 


28.56 


SAFIR 


25.40 


27.40 


20.71 


2S.76 


22. 7S 


2S.85 


20.88 


SO. 52 


BMSD 


S0.S6 


29. SO 


22.91 


26.08 


23.93 


24.79 


21.50 


S2.50 


BMSDbin 


SI. 15 


30.07 


20.57 


25.64 


22.00 


24.28 


20.84 


SS.52 


NLPCA 


SI. 08 


29.07 


20.96 


24.49 


20.96 


2S.18 


19. 7S 


33.73 


NLSPCA 


31.46 


29.51 


21.15 


24.89 


21.08 


2S.41 


20.15 


SS.69 


NLSPCAbin 


27.65 


21.45 


16.00 


21.47 


18.44 


20. S5 


18. S5 


29. IS 



Table 2 Experiments on simulated data (average over five noise realizations). Flag and Saturn images are displayed in Figs.[8| 
|6]and[7| and the others are given in 38 and in [46] . 



By the symmetry of the problem in U and y, one has the 
following equivalent updating rule for V\ 

= - Ul,^^{^y,v{VtVt) - Y)El^M ' (29) 
where is a diagonal matrix of size N x N: 

M 

Ek = diag (^^e^p{UtVt)i,iUl- j^, ■ ■ ■ , 

i=l 

n 

i = i 
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0.0137 photons per voxels), and this denoised channel with BM4D f28^ (with inverse Anscombe as in 31 ), the multiscale 
partitioning method [25], and our proposed method NLSPCA with patches of size 5 x 5 x 23. Note how the highlighted detail 
shows structure in the average over channels, which appears to be accurately reconstructed by our method. 
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